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ABSTRACT 

We  propose  an  error  concealment  scheme  for  MPEG-2  com¬ 
pressed  (AAC)  audio  using  a  novel  modulo  watermarking 
technique.  It  can  be  used  on  top  of  other  error  control  schemes, 
After  the  modulo  watermark  is  embedded,  an  MPEG-2  AAC 
audio  only  shows  negligible  file  size  increase  and  moderate 
SNR  penalty.  For  audio  transmission  over  packet-switch 
networks  (e.g.,  the  Internet),  using  our  watermark-based  con¬ 
cealment  scheme  shows  consistent  SNR  gain  over  using  con¬ 
ventional  concealment  schemes. 

1.  INTRODUCTION 

Reliable  transmission  of  digital  audio  over  packet-switched 
networks  (e.g.,  the  Internet)  is  a  challenging  task  because 
the  Internet  is  a  best-effort  network  that  offers  no  QoS  guar¬ 
antee.  Although  channel  coding  can  be  used  to  protect  the 
audio  from  packet  loss,  it  usually  introduces  extra  redun¬ 
dancy/payload.  On  the  other  hand,  error  concealment  [1,2, 

3, 4],  which  typically  extracts  features  from  the  received  au¬ 
dio  and  uses  them  to  recover  the  lost  data,  is  very  attractive 
in  audio  transmission  as  it  improves  the  perceptual  quality 
without  the  need  of  additional  payload. 

There  are  two  issues  in  error  concealment:  complex¬ 
ity  of  the  receiver  and  inaccurate  extraction  of  enhancement 
features  at  the  decoder.  Both  can  be  addressed  by  extracting 
the  features  at  the  encoder  and  transmitting  them  to  the  de¬ 
coder  along  with  the  audio.  However,  this  method  has  the 
same  disadvantage  as  using  channel  coding  in  that  an  extra 
payload  is  required.  This  extra  payload  not  only  uses  up 
more  bandwidth,  but  necessarily  modifies  the  audio  format 
if  neither  a  common  area  nor  a  user  data  area  is  available. 
This  format  change  makes  the  audio  no  longer  decodable  by 
an  ordinary  decoder. 

In  this  work,  we  apply  data  hiding  techniques  [1]  to 
embed  these  enhancement  features  for  error  concealment 
of  MPEG-2  AAC  audio  [5,  6].  Specifically,  a  novel  mod¬ 
ulo  watermarking  scheme  is  deployed  for  the  first  time 
to  hide  the  enhancement  features.  Modulo  watermarking, 
which  extracts  the  hidden  data  as  the  modulo  of  the  sum 
of  a  watermarked  integer  signal  samples,  is  an  example  of 
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one-to-many  embedding  schemes.  In  other  words,  several 
different  watermarked  signals  can  contain  the  same  hidden 
data.  This  property  gives  the  watermark  encoder  freedom 
in  selecting  a  watermarked  signal  with  small  perceptual  dis¬ 
tortion. 

Portions  of  the  AAC  encoded  audio  such  as  audio  head¬ 
ers  are  naturally  more  important  than  the  others.  When  the 
encoded  audio  is  transmitted  via  a  noisy  channel  (e.g.,  the 
Internet),  unequal  error  protection  is  usually  applied  to  en¬ 
sure  almost  no  corruption  on  these  portions.  In  this  work, 
we  assume  the  headers  are  very  well  protected  and  can  be 
fully  recovered.  However,  frequency  coefficients,  which  are 
less  important,  may  be  lost  during  transmission.  When  this 
happens,  we  extract  the  enhancement  features  from  embed¬ 
ded  watermarks  and  use  them  for  error  concealment. 

After  the  modulo  watermark  is  embedded,  an  MPEG-2 
AAC  audio  only  shows  negligible  file  size  increase  (<  0.1 
%)  and  moderate  SNR  penalty  (<  0.7  dB).  For  audio  trans¬ 
mission  over  packet-switch  networks,  using  our  watermark- 
based  concealment  scheme  shows  consistent  SNR  gain  over 
using  conventional  concealment  schemes  with  zero  replace¬ 
ment  or  the  frame  duplication. 

2.  AAC  ERROR  CONCEALMENT 
2,1.  Advanced  Audio  Coding  (AAC) 

AAC  [6],  which  is  included  in  the  MPEG-2  audio  stan¬ 
dard,  is  the  non-backward  compatible  successor  of  MPEG- 
1  Layer  3  audio  coding  (MP3).  AAC  encoding  consists  of 
four  steps:  frequency  transform,  quantization,  entropy  cod¬ 
ing,  and  bitstream  multiplexing.  AAC  employs  modulated 
discrete  cosine  transform  (MDCT)  typically  with  1024  sam¬ 
ples  per  frame.  The  1024  frequency  samples  in  each  time 
frame  are  separated  into  49  frequency  bands.  Within  the 
same  frequency  band,  samples  are  considered  to  have  simi¬ 
lar  perceptual  effect  to  the  human  ears  and  hence  share  the 
same  quantization  step  size.  Perceptual  modeling  is  applied 
to  the  MDCT  coefficients  to  estimate  the  maximum  amount 
of  distortion  that  can  be  withstood  by  each  coefficient.  The 
quantization  step  size  is  iteratively  modified  until  both  the 
rate  is  below  the  target  bit  rate  and  the  distortion  is  below 
the  maximum  acceptable  value  obtained  from  the  perceptual 
model.  Huffman  coding  is  used  to  encode  the  quantized  co- 
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efficients  and  the  quantization  step  size*  Finally*  the  coded  3.  WATERMARK  EMBEDDING  OF 

indices  will  be  multiplexed  into  a  single  bitstream*  ENHANCEMENT  INFORMATION 


2.2.  Proposed  Error  Concealment  Scheme 


3.1.  Choices  of  Watermarking  Schemes 


Since  a  coefficient  is  most  effectively  estimated  by  its  near¬ 
est  neighbors*  ideally*  adjacent  coefficients  along  both  time 
and  frequency  axes  should  not  be  packed  together*  because 
the  sources  of  estimation  will  be  lost  as  well  when  the  packet 
is  dropped.  However*  we  do  not  impose  this  as  a  require¬ 
ment  of  our  scheme,  because  we  target  at  overlaying  our 
scheme  on  any  other  error  control  scheme  [7]. 

As  coefficients  inside  a  frequency  band  share  similar 
perceptual  behavior*  we  choose  to  group  them  together  for 
estimation*  Denote  ( n ,  i)-band  as  the  the  ith  band  at  the  nlh 
time  frame  and  assume  coefficients  6[n,fc]  in  (n*i)-band 
are  lost*  where  k  G  /C*  and  and  &i  is  the  index  set  of  the 
Ith  band.  We  estimate  as  S0[n*fc]  =  0,6i(n,jfe]  = 
b[n  -  l*&]*S2[n*&]  =  b[n  +  1  ,fc],  or  S3[n,&]  =  |(6[n  - 
lyk]  +  6[n  +  !*&]), 

For  each  of  the  above  four  choices*  we  define  c[n}  i]  as 
the  index  that  minimizes  the  mean  square  error*  That  is. 


c[n,i]  =  argminc6{o,i,2)3>  (fcK  k)  “  bc[n,k})2. 

k€Ki 

c[n*  i]  is  pie-computed  and  embedded  into  the  original  AAC 
audio.  Embedding  e[n,  i]  into  the  (n*  i)-band  itself  will  not 
work  because  when  we  need  this  information,  the  band  is 
lost,  so  is  c[n,  f].  We  split  c[n,  i]  into  two  bits  and  embed 
them  separately  in  the  two  neighboring  bands* 

Define 


0*  if  e[n-  l,i]  G  {0*  1}  A  c[n  +  1,  i]  6  {0*2}* 

1*  if  c[n  —  l?i]  €  {2,3}  A  e[n  +1,1]  €  {0,2}, 

2,  if  c{n  —  1 ,«]  €  {0, 1}  A  c[n  +  l,i]  €  {1*3}, 

3,  if  c[n  -  1  ,t]  G  {2,3}  A  c[n  +  l,t]  €  {1,3}, 


The  higher  and  the  lower  bit  of  d[n,  i]  tell  whether  the  cur¬ 
rent  band  is  suitable  for  estimating  the  band  in  the  next  time 
frame  ((n-hl*  i)-band)  and  in  the  last  time  frame  ((n- 1,  «)- 
band),  respectively. 

For  example*  suppose  the  (n,  i)-band  is  lost,  from  the 
lower  bit  of  d[n  +  1,  i]  and  the  higher  bit  of  d[n  —  1,  i],  we 
can  determine  whether  the  current  band  should  be  estimated 
from  any  of  its  neighbors.  When  it  is  estimated  from  both 
sides,  it  is  scaled  by  1/2,  If  one  of  its  neighbors  is  lost,  we 
estimate  the  current  band  from  the  remaining  neighbor.  If 
both  neighbors  are  loss,  then  we  assume  e[n,  i]  =  0  and 
replace  the  coefficients  by  zeros. 


Watermarking  schemes  [8,  9*  10]  can  be  categorized  into 
two  classes:  robust  watermarking  [9]  and  fragile  watermark¬ 
ing  [1 1*  12]*  Robust  watermarking  is  designed  to  withstand 
common  signal  processing  attacks  ( <  10  bits/sec  for  audio), 
while  fragile  watermarking  is  sensitive  to  any  modifications 
but  has  a  much  higher  embedding  rate  (  ~  1000  bits/sec  for 
audio). 

Since  there  are  two  bits  for  each  d[n ,  i]  and  one  d[n ,  i ] 
per  band,  for  a  dual  channel  audio  clip  with  sampling  rate 
44100  Hz*  the  embedding  rate  is  about  44100/1024  x  49  x 
2x2  =  8kbits/see,  which  is  too  high  for  robust  water¬ 
marking*  Therefore,  fragile  watermark  is  the  only  possible 
option* 

A  typical  fragile  watermarking  scheme  Is  least  bit  mod¬ 
ulation  (LBM).  One  can  embed  a  bit  into  a  host  signal  by 
simply  replacing  the  least  significant  bit  of  one  signal  sam¬ 
ple  by  the  embedding  bit.  The  information  embedding  rate 
of  LBM  can  be  very  high.  For  example,  if  we  embed  a 
bit  into  each  sample  of  dual  channel  audio  with  a  sampling 
rate  of  44100  Hz,  the  embedding  rate  Is  up  to  44100  x2  = 
80kbits/sec  In  theory.  However,  since  only  the  least  signif¬ 
icant  bit  is  modified,  the  watermark  can  be  removed  easily 
by  truncating  the  last  bit.  Fortunately,  unlike  dealing  with 
copyright  protection  applications*  deliberate  attacks  to  our 
watermark  is  not  likely. 

3.2.  Fragile  Modulo  Watermarking 

Since  different  signal  samples  may  have  different  suscep¬ 
tibilities  to  distortion,  we  should  adaptively  select  the  em¬ 
bedding  locations.  However,  for  LBM,  both  the  encoder 
and  the  decoder  have  to  agree  upon  a  predefined  embedding 
locations,  because  there  is  no  side-information  in  telling  the 
decoder  the  embedding  locations.  Note  that  it  may  not  be  a 
problem  for  some  other  applications  in  which  a  key  is  avail¬ 
able  for  decoding*  because  the  key  itself  can  serve  as  the 
side-information*  However,  for  the  error  concealment  prob¬ 
lem,  it  is  not  reasonable  to  require  a  user  to  provide  a  “key” 
before  enhancement  is  performed. 

To  enable  flexible  encoding,  we  propose  a  novel  fragile 
watermarking  technique  that  does  not  require  the  decoder 
to  have  the  knowledge  of  the  exact  embedding  locations. 
Let  x  =  xi^x2i  ...arjv  be  an  arbitrary  integer  host  signal 
sequence.  We  embed  an  integer  k  G  [0,  K)  by  enforcing  the 
following: 


N 

^  x i  =  k  mod  K. 

i=l 
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Note  that  LBM  is  a  special  case  of  modulo  watermarking 
when  N  =  1  and  K  =  2. 

There  is  more  than  one  possible  watermarked  signal  con¬ 
taining  with  the  same  embedded  information.  The  encoder 
has  the  freedom  of  choosing  locations  of  modifications  so 
that  the  watermarked  signal  is  perceptually  closest  to  the 
original.  Despite  that,  the  decoder  does  not  really  need  to 
know  these  locations  where  modifications  have  been  made, 

3.3.  Enhancement  Information  Embedding  using  Mod¬ 
ulo  Watermarking 

One  limitation  in  applying  our  fragile  watermarking  is  that 
it  can  only  be  deployed  after  quantization,  otherwise  the  wa¬ 
termark  will  be  destroyed.  Moreover,  since  it  is  very  hard 
to  embed  watermark  into  a  Huffman  coded  signal,  we  em¬ 
bed  the  enhancement  features  into  the  quantization  indices, 
which  are  obtained  after  partial  decoding.  After  watermark¬ 
ing,  the  modified  indices  will  be  encoded  using  Huffman 
coding  with  the  original  codebook. 

With  the  freedom  of  embedding  given  by  modulo  wa¬ 
termarking,  the  question  left  is  what  indices  and  by  how 
much  they  should  be  modified.  Ideally,  this  can  be  done  by 
applying  perceptual  modeling  to  the  original  audio.  For  ex¬ 
ample,  if  we  know  one  coefficient  can  afford  a  distortion  of 
10  units  and  its  current  quantization  step  size  is  2  unit.  Then 
we  know  that  we  can  approximately  vary  the  corresponding 
index  by  5  steps  without  affecting  the  quality1. 

However,  the  perceptual  model  may  not  be  accessible, 
because  the  file  has  already  been  compressed.  Although  we 
can  estimate  the  model  parameters  from  the  decompressed 
audio,  the  estimation  is  not  accurate  in  general.  Therefore, 
we  employ  a  heuristic  approach  as  follows  without  using  the 
perceptual  model; 

To  embed  d[n ,  i)  into  the  quantization  indices  g[n,  k) 
of  (n,i)-band,  k  €  Ku  let  l  =  J2k(=Ki  q[n,k]  ~  d[n,i] 
mod  Kf  where  K  is  the  number  of  different  values  that  can 
be  embedded. 

In  this  work,  we  pick  K  as  four.  Assume  0  <  l  < 
Kf  2  =  2,  we  embed  modulo  watermark  d[n,  i]  in  the  fol¬ 
lowing  three  steps: 

1,  Among  all  indices  that  lie  within  range  [/min,/max]» 
select  the  l  largest  in  magnitudes, 

2,  Declare  embeddi  ng  failure  and  leave  indices  unchanged 
if  less  than  l  indices  can  be  found  in  step  1. 

3,  Subtract  each  of  those  indices  by  1, 

If  4  >  I  >  2,  replace  l  as  4  —  l  and  proceed  all  steps 
except  modifying  the  last  one  with  addition  instead  of  sub¬ 
traction. 

Since  the  enhancement  features  (ef  s)  are  independently 
stored,  they  are  useful  even  when  only  a  fraction  of  them 

1  Uniform  quantization  is  assumed  in  this  simplified  example. 


is  retrieved  correctly.  Therefore,  embedding  failure  in  the 
scheme  is  acceptable. 

The  lower  limit  Im in  the  first  step  restrains  modifica¬ 
tion  of  small  value  indices,  because  they  are  more  proba¬ 
ble  to  have  high  susceptibility  to  distortion.  In  particular, 
no  distortion  should  be  imposed  on  zero  indices,  Imin  also 
serves  as  a  design  parameter  in  trading  error  free  distortion 
with  error  concealment  capability.  As  Jmin  increases,  it  is 
more  likely  that  the  embedding  of  d[n%  i)  fails  and  leaves 
the  indices  with  no  distortion.  However,  inaccurate  d[n ,  zj’s 
will  make  the  error  concealment  process  less  efficient.  In 
our  experiment,  Jmiil  is  simply  set  to  be  1. 

imax  is  equal  to  the  maximum  possible  value  available  in 
the  Huffman  table  less  1.  This  prevents  indices  from  being 
out  of  bound  after  modification.  Large  indices  are  selected 
for  modification  because  they  can  withstand  larger  distor¬ 
tion, 

4.  EXPERIMENTAL  RESULTS 

4.1.  Increase  of  Audio  File  Size  After  Watermark  Em¬ 
bedding 

The  Huffman  codebook  used  in  the  original  audio  is  opti¬ 
mized  in  the  AAC  encoder.  Since  we  modify  the  indices  but 
keep  the  old  codebook,  it  is  expected  the  size  of  the  com¬ 
pressed  file  will  increase  after  watermark  embedding.  How¬ 
ever,  the  increase  should  be  small  because  we  only  change 
relatively  few  indices.  Table  I  indeed  confirms  this  -  the 
size  increase  is  less  than  0,1  %  over  all  test  audio  clips. 

In  contrast,  from  Section  3,1,  we  need  8  kbits/sec  if  an 
explicit  overhead  is  written  to  the  audio.  This  corresponds 
to  8/256=3  %  of  total  file  size  for  an  audio  encoded  at  256 
kbits/sec. 


|l  clipl 

clip2 

clip3 

clip4 

clips 

clip6 

clip?  II 

|  0.02% 

0,02% 

0.06% 

0.01% 

0.03% 

0,06% 

0.06%  I 

Table  1:  Percentage  change  in  audio  clip  size  after  watermarking. 


4.2.  Audio  Quality  after  Watermark  Embedding 

After  modulo  watermarks  are  embedding  into  an  AAC  au¬ 
dio  file,  we  expect  the  quality  of  the  decoded  audio  clip  to 
deteriorate  somewhat.  However,  our  test  shows  that  the  per¬ 
ceptual  quality  of  the  watermarked  audio  clips  is  acceptable 
in  office  or  lab  environment.  As  an  objective  measure,  we 
compare  the  SNR  difference  of  each  AAC  coded  audio  clip 
before  and  after  watermark  embedding.  The  SNR  decrease 
due  to  watermark  embedding  is  between  0,03  dB  and  0.68 
dB  (Table  2). 

4.3.  Error  Concealment  Results 

We  assume  the  AAC  audio  coefficients  are  packetized  and 
transmitted  via  a  noisy  channel.  Each  packet  consists  of  co¬ 
efficients  from  one  time  frame.  A  packet  is  either  correctly 
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AAC  audio 

After  watermarking 

SNR  changes 

clipl 

32.87 

32.69 

0.18 

clip2 

18.18 

17.95 

0.23 

cllp3 

17.13 

17.10 

0.03 

clip4 

31.50 

31.29 

0.21 

clips 

28.66 

27.99 

0.67 

clip6 

24.47 

23.79 

0,68 

clip7 

26.73 

26.69 

0.04 

Table  2:  SNR  change  (in  dB)  after  embedding  enhancement  infor¬ 
mation. 


1 

Packet  loss  ratio  || 

0.01 

0.02 

0,05 

0.1 

0.2 

clipl 

Ours 

Ref.l 

Ref.2 

22.79 

20.92 

18.60 

20.99 

16.99 
15.06 

15,80 

12,90 

10.63 

13,25 

10.01 

7,61 

9.91 

6,74 

4.24 

elip2 

Ours 

Ref.l 

Ref.2 

16.93 

16.01 

15,02 

15.94 

14,56 

13.01 

13.92 

11.87 

9.90 

11.80 

9,47 

7.20 

9.49 

6.82 

4,42 

elip3 

Ours 

Ref.l 

Ref.2 

16.12 

15,73 

14.41 

15,23 

14.39 

12.49 

13.06 

11.81 

9.36 

11,16 

9,50 

6.71 

8.65 

6,87 

3.92 

clip4 

Ours 

Ref.l 

Ref.2 

23.74 

20.64 

17.18 

19.62 

17.37 

14.22 

15.27 

12,88 

10.15 

12.42 

9.99 

7,25 

9,55 

6,98 

4.09 

elip5 

Ours 

Ref.l 

Ref.2 

23.93 

22.17 

19.35 

21.20 

18.75 

15.08 

14.91 

12.73 

10.13 

12.63 

10.35 

7.67 

9.30 

6.92 

4,53 

clip6 

Ours 

Ref.l 

Ref.2 

20.73 
19.99 

16.73 

18.82 

17.06 

14.19 

16.81 

13.17 

9,18 

13.62 

10.57 

6.61 

10,59 

7.19 

3.19 

cltp7 

Ours 

Ref.l 

Ref.2 

23.33 

20.07 

18.82 

21.10 

17.46 

15.87 

15.19 

12.16 

8.59 

13.26 
9.97 

6.26 

9.87 

7.05 

3.36 

Table  3:  SNR  comparison  (in  dB)  of  three  different  error  con¬ 
cealment  schemes:  our  scheme  (upper),  zero  replacement  scheme 
(middle),  blindly  duplication  from  previous  time  frame  (lower), 

received  or  lost.  A  periodic  packet  loss  is  assumed  in  our 
simulation  with  a  fixed  packet  loss  ratio.  We  compare  our 
scheme  with  two  reference  schemes  (Ref.l  [13]  and  Ref.2 
[14]  ),  In  Ref.l,  all  lost  coefficients  are  set  to  0,  In  Ref,2, 
the  previous  adjacent  time  frame  is  copied  to  the  current  lost 
one  (Table  3). 

Our  watermark-based  concealment  scheme  gives  higher 
SNR  than  Ref,  1  and  Ref.  2  in  all  cases.  The  slight  drop 
in  SNR  due  to  watermark  embedding  is  quickly  offset  by 
the  gain  obtained  from  our  concealment  scheme  even  at  a 
small  packet  loss  ratio  of  0,01,  Moreover,  the  gain  is  more 
conspicuous  as  packet  loss  ratio  increases. 

5.  CONCLUSION 

We  have  proposed  an  error  concealment  scheme  for  AAC 
audio  using  digital  watermarking,  which  can  be  overlaid 
on  other  error  control  schemes  effectively.  A  novel  mod¬ 
ulo  watermarking  technique  is  described  and  incorporated 
into  our  scheme.  After  the  modulo  watermark  is  embedded, 


an  MPEG-2  AAC  audio  only  shows  negligible  file  size  in¬ 
crease  and  moderate  SNR  penalty.  For  audio  transmission 
over  packet-switch  networks,  using  our  watermark-based 
concealment  scheme  shows  consistent  SNR  gain  over  using 
conventional  concealment  schemes.  Although  simulations 
are  done  on  AAC  audio  in  this  paper,  our  scheme  can  be 
easily  extended  to  other  media  formats, 
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